Synopsis: Poor Man’s Search Engine
Let’s learn how using search for extensive collections causes an antipattern.
We'll cover the following
I was working in a technical support job in 1995, at a time when companies were just starting to adopt the Web as a way to provide information to their customers. We had a collection of short documents describing solutions to common support questions, and we wanted to put them on the Web in a knowledge base application.
As the collection grew, I quickly realized that it needed to be searchable because customers didn’t want to browse through hundreds of articles to find what they were looking for. One strategy would be to organize the articles into categories, but even these were too large, and many articles belonged in multiple categories.
We wanted our customers to search the articles, narrowing down the list to those matching any criteria. The most flexible and straightforward interface allowed the customer to enter any set of words and show them the articles in which those words appear.
An article was weighted higher if it matched the search terms more fully. Also, we wanted to match word forms. For example, a search for the word “crash” should also match “crashed”, “crashes”, and “crashing”. Of course, the search had to work in a growing collection of documents quickly enough to be useful in a web application.
If this description sounds superfluous, it shouldn’t be surprising. Searching through text online is now so common that we can hardly recall the time before it was available. But using SQL to search by keywords while also making the solution both fast and accurate is deceptively difficult.
Objective: Full-text search#
Any application that stores text needs to search for words or phrases within that text. We use databases to store more textual data than ever, and, at the same time, we demand to be able to search text that matches our criteria at ever greater speeds. Web applications especially need high-performance and scalable database techniques for searching text.
One fundamental principle of SQL (and relational theory from which SQL is derived) is that a value in a column is atomic. That is, we can compare a value to another value, but we always compare the whole value when we do that. Comparing substrings is bound to be inefficient or inaccurate in SQL.
Despite this, we need a way to compare a short string with a longer string and to find a match when the short string occurs anywhere within the longer string. How can we achieve this using SQL?
Legitimate uses of the antipattern#
We will see that the expressions in the next lesson are legal SQL queries and that they have a straightforward and simple usage. That counts for a lot.
Performance is often important, but some queries are run so infrequently that it doesn’t make sense to invest a lot of resources to optimize them. Maintaining indexes to benefit a rarely used query could be just as costly as running that query inefficiently. If the nature of a query is ad hoc, there’s no guarantee that the index we define would benefit that query anyway.
It’s hard to use pattern-matching operators for complex queries, but if we design the patterns for simple cases, they can help us get the right results with the least fuss.